| Model for | low prediction error | explainable | robust | causal | maintainable | usability(prediction time, model size) |
|---|---|---|---|---|---|---|
| Datathon | Yes | - | - | - | - | - |
| Research | Yes | yes | yes | maybe | - | - |
| Industry Project | yes | yes | yes | maybe | yes | yes |
Datathon:
Research:
Industry Project:
(fat ones are necessary, the rest is optional)
knitr::opts_chunk$set(echo = TRUE, cache = F, message = F, warning = F, comment = F)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidymodels)
## Registered S3 method overwritten by 'tune':
## method from
## required_pkgs.model_spec parsnip
## -- Attaching packages -------------------------------------- tidymodels 0.1.3 --
## v broom 0.7.9 v rsample 0.1.0
## v dials 0.0.10 v tune 0.1.6
## v infer 1.0.0 v workflows 0.2.3
## v modeldata 0.1.1 v workflowsets 0.1.0
## v parsnip 0.1.7 v yardstick 0.0.8
## v recipes 0.1.16
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x scales::discard() masks purrr::discard()
## x dplyr::filter() masks stats::filter()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x yardstick::spec() masks readr::spec()
## x recipes::step() masks stats::step()
## * Use tidymodels_prefer() to resolve common conflicts.
library(patchwork)
library(vip)
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
theme_set(theme_bw())
ncores <- 4
https://www.tidymodels.org/packages/
The tidymodels framework is a collection of packages for modeling and machine learning using tidyverse principles.
Here is a selection of packages we will use:
| With the functions from the rsamples package, we split our data into different folds for 5-fold cross validation | |
| We will use the Recipes package to transform the data. It creates a data transformation recipe, that can be applied to various datasets. | |
| We use the parsnip package to define our models | |
| The workflow package creates a workflow for us, it defines the model and the data transformation we want to use | |
| The tune package can be used to tune the hyperparameters of the model (for example the number of trees, if we use a random forrest, or the maximum size of a tree) | |
| .. and the dials package creates and manages tuning parameters and parameter grids. | |
| The yardstick package provides us with functions to evaluate the performance of our model |
trainval_data <- read_csv("data/train.csv")
test_data <- read_csv("data/test.csv")
names(trainval_data)
FALSE [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
FALSE [5] "LotArea" "Street" "Alley" "LotShape"
FALSE [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
FALSE [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
FALSE [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
FALSE [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
FALSE [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
FALSE [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
FALSE [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
FALSE [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
FALSE [41] "HeatingQC" "CentralAir" "Electrical" "1stFlrSF"
FALSE [45] "2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
FALSE [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
FALSE [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
FALSE [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
FALSE [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
FALSE [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
FALSE [69] "EnclosedPorch" "3SsnPorch" "ScreenPorch" "PoolArea"
FALSE [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
FALSE [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
FALSE [81] "SalePrice"
Clean the data set
p <- DataExplorer::plot_missing(trainval_data, missing_only = T)
cols_to_remove <- p$data %>% filter(Band %in% c("Remove", "Bad")) %>% pull(feature)
cols_to_remove
FALSE [1] Alley FireplaceQu PoolQC Fence MiscFeature
FALSE 81 Levels: PoolQC MiscFeature Alley Fence FireplaceQu ... SalePrice
data <- bind_rows(
trainval = trainval_data,
test = test_data,
.id = "source"
)
data <- data %>% replace_na(
list(
Alley = "No Alley Access",
BsmtQual = "No Basement",
BsmtCond = 'No Basement',
BsmtExposure = "No Basement",
BsmtFinType1 = "No Basement",
BsmtFinType2 = "No Basement",
FireplaceQu = "No Fireplace",
GarageType = "No Garage",
GarageFinish = "No Garage",
GarageQual = "No Garage",
GarageCond = "No Garage",
PoolQC = "No Pool",
Fence = "No Fence",
MiscFeature = "None"
)
)
trainval_data <- data %>%
filter(source == "trainval") %>%
select(-source)
test_data <- data %>%
filter(source == "test") %>%
select(-source)
p <- DataExplorer::plot_missing(trainval_data, missing_only = T)
cols_to_remove <- p$data %>%
filter(Band %in% c("Remove", "Bad")) %>%
pull(feature)
cols_to_remove
FALSE factor(0)
FALSE 81 Levels: LotFrontage GarageYrBlt MasVnrType MasVnrArea Electrical ... SalePrice
Feature engineering (and feature selection) are often the most important parts in the modeling process.
ggplot(trainval_data, aes(x=SalePrice)) +
geom_histogram(aes(y=..density..), colour = "black", alpha = .2) +
geom_density(fill="red", alpha=.2) +
labs(x = "", y = "")
here: prices are strictly positive (above zero). Should a house be twice as expensive, if it’s twice as large?
If the dependence on our variables is multiplicative (i.e. the price doubles, when the size doubles, a bathroom increases the price of the house by a certain percentage, i.e the absolute increase is higher for an expensive house, than for a cheap house), then we want to use log-linear regression, or poisson regression.
Example1:
\(Y = a_1 X_1 + a_2 X_2 + a_3 X_3 + ... + a_n X_n\) - features are additive
\(log(Y) = b_1 log(X_1) + b_2 log(X_2) + b_3 log(X_3)+ ... + b_n log(X_n)\) - features are multiplicative!
by using a log-transformation on your (numerical) features, you can change how they influence the model outcome.
New Target variable: Log (SalePrice)
dependent variables will be transformed later..
trainval_data <- trainval_data %>% mutate(logSalePrice = log(SalePrice))
# hist
g1 <- ggplot(trainval_data, aes(x = logSalePrice)) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
geom_density(alpha = .2, fill = "#FF6666") +
labs(x = "", y = "")
# boxplot
g2 <- ggplot(trainval_data, aes(y = logSalePrice)) +
geom_boxplot(aes(x = ""), colour = "black", fill = "white") +
coord_flip() +
labs(x = "", y = "")
# qqplot
g3 <- ggplot(trainval_data, aes(sample = logSalePrice)) +
stat_qq() +
stat_qq_line() +
labs(x = "", y = "")
g3 | g1 / g2
| transformation | example |
|---|---|
| non-linear transform | example: log-transform on both target and features, if features are multiplicative rather than additive. Only on features, if the target depends logarithmically on the feature |
| indicator function | if a feature is present or not, i.e. if we’re only interested if the house has a kitchen, but not how many, or the increase in the number of kitchens is not important |
| f(feature1, feature2) | if two features interact, we can create a new feature from the interaction, for example: the average number of squarefeet per room : number of rooms/total living surface |
| binning | if we are not interested in a very fine grained information, we can bin the data, for example, the age of the house in steps of 20 years |
| decorrelations | if variables are highly correlated, we can think about decorrelating them, either by using PCA (and variable reduction) or by creating variables that are less correlated |
trainval_data %>%
select(where(is.numeric)) %>%
cor( method="kendall", use="pairwise.complete.obs") %>%
corrplot::corrplot()
especially if we use models that are not robust, like linear regression, outliers can severely skew our results. Sometimes it can be good, to remove outliers from the data before training the model.
g1 / g2
# identify points in the extreme 5% of the observed data
trainval_data %>% filter(
logSalePrice > quantile(logSalePrice, .975) | # OR
logSalePrice < quantile(logSalePrice, .025)) %>%
select(SalePrice, Alley, LotArea, PoolArea, MiscFeature)
FALSE # A tibble: 72 x 5
FALSE SalePrice Alley LotArea PoolArea MiscFeature
FALSE <dbl> <chr> <dbl> <dbl> <chr>
FALSE 1 68500 No Alley Access 6324 0 None
FALSE 2 40000 Pave 8500 0 None
FALSE 3 385000 No Alley Access 50271 0 None
FALSE 4 438780 No Alley Access 13682 0 None
FALSE 5 79000 No Alley Access 9600 0 None
FALSE 6 412500 No Alley Access 13688 0 None
FALSE 7 501837 No Alley Access 17423 0 None
FALSE 8 475000 No Alley Access 22950 0 None
FALSE 9 386250 No Alley Access 13472 0 None
FALSE 10 403000 No Alley Access 15138 0 None
FALSE # ... with 62 more rows
# to remove these, invert the process;
# trainval_data <- trainval_data %>% filter(
# logSalePrice < quantile(logSalePrice, .975) & # AND
# logSalePrice > quantile(logSalePrice, .025))
you may also want to look at outliers in the features.
| With the functions from the rsamples package, we split our data into different folds for 5-fold cross validation. We can split it in such a way, that in all folds, the distribution of our target variable is identical. |
In order to be able to evaluate your model quality, you need a train and a validation set. You train the model on the training set and evaluate it on the evaluation set. We split our training data, so that 80% of it will be the training set, and 20% the validation set.
set.seed(24)
data_split <- initial_split(trainval_data, prop = 4/5, strata = 'logSalePrice', breaks = 10,)
train_data <- training(data_split)
val_data <- testing(data_split)
cat('train+validation data size is:',dim(trainval_data),'\n')
FALSE train+validation data size is: 1460 82
cat('train data size is:',dim(train_data),'\n')
FALSE train data size is: 1165 82
cat('validation data size is:',dim(val_data),'\n')
FALSE validation data size is: 295 82
cat('test data size is:',dim(test_data))
FALSE test data size is: 1459 81
set.seed(42)
SalePrice_vfold <- vfold_cv(train_data, v = 5, strata = SalePrice)
| We will use the Recipes package to transform the data. It creates a data transformation recipe, that can be applied to various datasets. A Recipes is an instruction how to change a dataset. It can be applied to the training data before modeling,or to the test data before prediction |
| recipe | pipeable sequences of feature engineering steps to get your data ready for modeling |
|---|---|
| update_role | update the role of the independent variables (predictor or identifier: identifiers are not used for prediction) |
| step_rm | removes columns from dataframe |
| step_unknown | Assign missing categories to “unknown” (or a name of your choice) |
| step_log | Logarithmic Transformation |
| step_normalize | all numeric variables are normalized |
| step_other | factors are grouped together to the “other” class, if they are less frequent than the threshold |
| step_novel | if previously unseen factors are encounter, they are assigned to the “new” level |
| step_impute_knn | missing values are imputed with the k-next neighbor algorithm |
| step_dummy | ategorical variables are one-hot encoded |
| step_nzv | variables with near zero variance are removed |
There are many more options, get all information here: https://recipes.tidymodels.org/reference/index.html
SalePrice_recipe <- recipe(
# set the target(dependent) and the feature (independent) variables
train_data, logSalePrice ~ .) %>%
#update the role of the independent variables (predictor or identifier)
update_role(Id, new_role = "ID") %>%
# remove columns we don't need
step_rm(Id, SalePrice) %>%
step_rm(any_of(cols_to_remove)) %>%
# # giving missing values a label
# step_unknown(c('BsmtQual','BsmtCond'), new_level='No Basement') %>%
# apply a log transformation to all numeric predictors
step_log(all_numeric(),-all_outcomes(), offset = 1) %>%
# normalize all numeric predictors
step_normalize(all_numeric(),-all_outcomes()) %>%
# group together factor levels that occur less than the threshold
step_other(all_nominal(), -all_outcomes(), threshold = 0.01) %>%
# if previously unseen factors are encounter, they are given the "new" level
step_novel(all_predictors(), -all_numeric()) %>%
# impute missing variables with k nearest neighbours
step_impute_knn(all_predictors()) %>%
# remove predictors with near-zero variance
step_nzv(all_predictors(), freq_cut=995/5) %>%
# add dummy variables for all nominal predictors
step_dummy(all_nominal(), -all_outcomes())
for which columns does knn - imputation make sense? How should the other columns be imputed?
with the prep() method
SalePrice_recipe2 <- prep(SalePrice_recipe, training=train_data)
tidy(SalePrice_recipe)
FALSE # A tibble: 9 x 6
FALSE number operation type trained skip id
FALSE <int> <chr> <chr> <lgl> <lgl> <chr>
FALSE 1 1 step rm FALSE FALSE rm_XMBSP
FALSE 2 2 step rm FALSE FALSE rm_dex5l
FALSE 3 3 step log FALSE FALSE log_9KhLu
FALSE 4 4 step normalize FALSE FALSE normalize_eBkTV
FALSE 5 5 step other FALSE FALSE other_QKTAI
FALSE 6 6 step novel FALSE FALSE novel_EXG5i
FALSE 7 7 step impute_knn FALSE FALSE impute_knn_PpuSA
FALSE 8 8 step nzv FALSE FALSE nzv_w2ycB
FALSE 9 9 step dummy FALSE FALSE dummy_EUsSZ
tidy(SalePrice_recipe2)
FALSE # A tibble: 9 x 6
FALSE number operation type trained skip id
FALSE <int> <chr> <chr> <lgl> <lgl> <chr>
FALSE 1 1 step rm TRUE FALSE rm_XMBSP
FALSE 2 2 step rm TRUE FALSE rm_dex5l
FALSE 3 3 step log TRUE FALSE log_9KhLu
FALSE 4 4 step normalize TRUE FALSE normalize_eBkTV
FALSE 5 5 step other TRUE FALSE other_QKTAI
FALSE 6 6 step novel TRUE FALSE novel_EXG5i
FALSE 7 7 step impute_knn TRUE FALSE impute_knn_PpuSA
FALSE 8 8 step nzv TRUE FALSE nzv_w2ycB
FALSE 9 9 step dummy TRUE FALSE dummy_EUsSZ
with the bake() method
bake(SalePrice_recipe2, train_data)
FALSE # A tibble: 1,165 x 238
FALSE MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt YearRemodAdd
FALSE <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
FALSE 1 0.990 0.244 0.338 -1.61 -0.461 -0.133 -0.876
FALSE 2 -0.568 -0.267 -0.666 -1.61 0.437 -1.46 -1.71
FALSE 3 0.631 -0.775 -0.110 -1.61 -1.52 -1.70 -1.71
FALSE 4 0.990 -0.0427 -0.752 -1.61 -0.461 -0.528 -1.47
FALSE 5 0.729 -0.267 -0.422 -0.725 1.22 -1.70 0.532
FALSE 6 -0.568 -0.948 -1.26 -1.61 0.437 -0.860 -1.71
FALSE 7 1.99 -3.16 -3.25 -1.61 -0.461 0.0641 -0.583
FALSE 8 0.152 1.30 -0.117 -2.70 -4.50 -1.86 -0.145
FALSE 9 -1.13 0.710 -0.110 -0.725 -2.82 -0.330 -1.17
FALSE 10 -1.13 0.282 0.360 -1.61 -0.461 -0.199 -0.974
FALSE # ... with 1,155 more rows, and 231 more variables: MasVnrArea <dbl>,
FALSE # BsmtFinSF1 <dbl>, BsmtUnfSF <dbl>, TotalBsmtSF <dbl>, 1stFlrSF <dbl>,
FALSE # 2ndFlrSF <dbl>, GrLivArea <dbl>, BsmtFullBath <dbl>, BsmtHalfBath <dbl>,
FALSE # FullBath <dbl>, HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenAbvGr <dbl>,
FALSE # TotRmsAbvGrd <dbl>, Fireplaces <dbl>, GarageYrBlt <dbl>, GarageCars <dbl>,
FALSE # GarageArea <dbl>, WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
FALSE # MoSold <dbl>, YrSold <dbl>, logSalePrice <dbl>, MSZoning_RH <dbl>, ...
| We use the parsnip package to define our models |
Here we will use Lasso, but there are other models you can use as well. Have a look here: https://parsnip.tidymodels.org/reference/index.html
regularized linear regression
get information on linear regression models here: https://parsnip.tidymodels.org/reference/linear_reg.html
and for this one in particular here: https://parsnip.tidymodels.org/reference/details_linear_reg_glmnet.html
Lasso_model <- parsnip::linear_reg(
mixture = 1, #here we specify that we want a Lasso model, with L1 regularization.
penalty = tune()) %>% #the penalty will be specified by tuning the model
set_engine("glmnet") %>%
translate()
we add a model and a data transformation recipe to our workflow.
Then, we can fit the model together with the transformed data using the workflow.
| The workflow package creates a workflow for us, it defines the model and the data transformation we want to use, |
Workflow_SalePrice_lasso <-
workflow() %>%
add_model(Lasso_model) %>%
add_recipe(SalePrice_recipe)
| The tune package can be used to tune the hyperparameters of the model (for example the number of trees, if we use a random forrest, the learning rate or the maximum depth of a tree)and the dials package creates and manages tuning parameters and parameter grids |
select a set of hyper parameters to explore. We use the dials package, which provides us with default values, that we can also adjust.
Have a look here: https://dials.tidymodels.org/reference
tune the hyperparameters. For example with grid search or bayesian optimization.
These algorithms use 5-fold cross validation to choose the hyperparameters for which the model performs best (across all folds).
look at the optimal hyper parameter values. If they happen to be at the border of our parameter range, it might be good to extend the range and repeat the hyperparameter tuning.
we define a grid of parameters. Each parameter combination will be used to fit a model.
For a small set of parameters that is a good option, for a large set, it is very impractical..
#setting the range for the parameter we want to explore
parameter_grid <- tibble(penalty = 10 ^ seq(-4,-1, length.out = 30))
lasso_tune_grid <-
Workflow_SalePrice_lasso %>%
tune_grid(
SalePrice_vfold,
grid = parameter_grid,
control = control_grid(save_pred = TRUE),
metrics = metric_set(rmse)
)
autoplot(lasso_tune_grid)
(https://en.wikipedia.org/wiki/Bayesian_optimization).
At each iteration step the most promising set of parameters for improving the model is chosen.
Uses e a mixture of:
set.seed(42)
lasso_params <- parameters(penalty())
lasso_tune_bayes <- Workflow_SalePrice_lasso %>%
tune_bayes(
resamples = SalePrice_vfold,
param_info = lasso_params,
# initial = ?,
iter = 30,
metrics = metric_set(rmse),
control = control_bayes(
no_improve = 5,
save_pred = T,
verbose = T
)
)
# doParallel::stopImplicitCluster()
lasso_tune_grid %>% collect_metrics()
FALSE # A tibble: 30 x 7
FALSE penalty .metric .estimator mean n std_err .config
FALSE <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
FALSE 1 0.0001 rmse standard 0.138 5 0.00818 Preprocessor1_Model01
FALSE 2 0.000127 rmse standard 0.138 5 0.00818 Preprocessor1_Model02
FALSE 3 0.000161 rmse standard 0.138 5 0.00815 Preprocessor1_Model03
FALSE 4 0.000204 rmse standard 0.137 5 0.00812 Preprocessor1_Model04
FALSE 5 0.000259 rmse standard 0.137 5 0.00810 Preprocessor1_Model05
FALSE 6 0.000329 rmse standard 0.136 5 0.00809 Preprocessor1_Model06
FALSE 7 0.000418 rmse standard 0.136 5 0.00813 Preprocessor1_Model07
FALSE 8 0.000530 rmse standard 0.135 5 0.00817 Preprocessor1_Model08
FALSE 9 0.000672 rmse standard 0.134 5 0.00819 Preprocessor1_Model09
FALSE 10 0.000853 rmse standard 0.134 5 0.00823 Preprocessor1_Model10
FALSE # ... with 20 more rows
lasso_tune_bayes %>%
collect_metrics() %>%
select(-.iter) %>%
add_column(tune = 'bayes') %>%
rbind(lasso_tune_grid %>%
collect_metrics() %>%
add_column(tune = 'grid')) %>%
ggplot(aes(x = penalty, y = mean, color = .metric)) +
geom_point(aes(shape = tune, size = tune)) +
geom_line() +
ylab("RMSE") +
scale_x_log10(labels = scales::label_number()) +
scale_size_manual(values = c(4, 2)) +
ggtitle("dependency of the model's prediction error (RMSE) on the penalty hyper-parameter")
lasso_tune_grid %>% show_best("rmse", n = 1)
FALSE # A tibble: 1 x 7
FALSE penalty .metric .estimator mean n std_err .config
FALSE <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
FALSE 1 0.00356 rmse standard 0.132 5 0.00788 Preprocessor1_Model16
lasso_tune_bayes %>% show_best("rmse", n = 1)
FALSE # A tibble: 1 x 8
FALSE penalty .metric .estimator mean n std_err .config .iter
FALSE <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <int>
FALSE 1 0.00216 rmse standard 0.132 5 0.00831 Iter3 3
#choose the hyperparameters that worked best
lasso_best_params <- select_best(lasso_tune_grid, "rmse")
#update the hyper parameters of the model
SalePrice_final_model <- finalize_model(Lasso_model, lasso_best_params)
# update the workflow with the updated model
Workflow_SalePrice_lasso <- Workflow_SalePrice_lasso %>% update_model(SalePrice_final_model)
| The yardstick package provides us with functions to evaluate the performance of our model |
(see error used to find the best hyperparameters)
set.seed(42)
lasso_fit_rs <-
Workflow_SalePrice_lasso %>%
fit_resamples(SalePrice_vfold)
collect_metrics(lasso_fit_rs)
FALSE # A tibble: 2 x 6
FALSE .metric .estimator mean n std_err .config
FALSE <chr> <chr> <dbl> <int> <dbl> <chr>
FALSE 1 rmse standard 0.132 5 0.00788 Preprocessor1_Model1
FALSE 2 rsq standard 0.893 5 0.0132 Preprocessor1_Model1
final_res_lasso <- last_fit(Workflow_SalePrice_lasso, split = data_split)
final_res_lasso$.metrics[[1]]
FALSE # A tibble: 2 x 4
FALSE .metric .estimator .estimate .config
FALSE <chr> <chr> <dbl> <chr>
FALSE 1 rmse standard 0.116 Preprocessor1_Model1
FALSE 2 rsq standard 0.914 Preprocessor1_Model1
SalePrice_lasso_fit <- fit(Workflow_SalePrice_lasso, data = train_data)
pred_lasso <-
predict(SalePrice_lasso_fit, val_data) %>%
mutate(model = "regression",
.pred = exp(.pred)) %>%
bind_cols(val_data %>% select(SalePrice))
g3 <-
pred_lasso %>%
ggplot(aes(y = .pred, x = SalePrice))+
geom_point()+
geom_abline(intercept = 0, col = "red")+
ggtitle('Lasso')
g4 <-
pred_lasso %>%
select(.pred, SalePrice) %>%
gather(key, value) %>%
ggplot(aes(x=value, volor = key, fill = key)) +
geom_density(alpha=.2)+
labs(x = "", y = "")
g3/g4
tidy( SalePrice_lasso_fit ) %>%
arrange(desc(estimate)) %>%
head(n=10)
FALSE # A tibble: 10 x 3
FALSE term estimate penalty
FALSE <chr> <dbl> <dbl>
FALSE 1 (Intercept) 11.9 0.00356
FALSE 2 Neighborhood_StoneBr 0.139 0.00356
FALSE 3 GrLivArea 0.129 0.00356
FALSE 4 Neighborhood_NridgHt 0.120 0.00356
FALSE 5 Neighborhood_Crawfor 0.101 0.00356
FALSE 6 Neighborhood_NoRidge 0.0893 0.00356
FALSE 7 SaleType_New 0.0891 0.00356
FALSE 8 OverallQual 0.0687 0.00356
FALSE 9 Exterior1st_BrkFace 0.0685 0.00356
FALSE 10 Neighborhood_Somerst 0.0648 0.00356
final_fit <-fit(Workflow_SalePrice_lasso, data = trainval_data)
pred_lasso <-
predict(final_fit, test_data) %>%
transmute(SalePrice = exp(.pred)) %>% # here we need to transform the prediction again, as we were working with logarithmic prices.
bind_cols(test_data %>% select(Id)) %>%
write_csv("submission.csv")
Xgb_model <-
parsnip::boost_tree(
trees = tune(), # the hyperparameters will be specified later, when we do hyperparameter tuning
learn_rate = tune(), # but you can also set some fixed here.
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
) %>%
set_mode("regression") %>%
set_engine("xgboost", nthread = ncores)
#This function only defines what type of model is being fit.
#Once an engine is specified, the method to fit the model is also defined.
Workflow_SalePrice_xgb <- workflow() %>%
add_model(Xgb_model) %>%
add_recipe(SalePrice_recipe) # re-using the same data pre-processing recipe
To get a list of all hyperparameters and what they do, look here: https://parsnip.tidymodels.org/reference/boost_tree.html or here: https://parsnip.tidymodels.org/reference/details_boost_tree_xgboost.html
xgboost_params <- parameters(
trees(),
learn_rate(),
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), train_data) # the parameter range of mtry (how many features are chosen at random for each tree)
) # depends on the data set (the number of features). With "finalize",
# we let the function determine the upper bound given the data set.
xgboost_params <- xgboost_params %>%
update(trees = trees(c(100, 500))) #parameter tuning ranges can also be updated later on.
-> we have many hyperparameters we want to optimize. Bayesian optimization is a better choice here.
When we tune the hyperparameters, we use 5-fold cross validation. The model is trained on 4 of the 5 folds and evaluated on the 5th fold. The algorithm then chooses the hyperparameters for which the model performs best across all folds. What we need:
set.seed(42)
# this may take a while...
# feel free to set the `initial`, `iter`, and `no_improve` paramaters lower
# for a faster (but less thorough) search.
xgboost_tune <- Workflow_SalePrice_xgb %>%
tune_bayes(
resamples = SalePrice_vfold,
param_info = xgboost_params,
initial = 10,
iter = 30,
metrics = metric_set(rmse, mape),
control = control_bayes(
no_improve = 5,
save_pred = T,
verbose = T
)
)
# doParallel::stopImplicitCluster()
autoplot(xgboost_tune)
xgb_best_params <- select_best(xgboost_tune, "mape")
SalePrice_final_model <- finalize_model(Xgb_model, xgb_best_params)
Workflow_SalePrice_xgb <- Workflow_SalePrice_xgb %>% update_model(SalePrice_final_model)
final_res_xgb <- last_fit(Workflow_SalePrice_xgb, split = data_split)
final_res_xgb$.metrics[[1]]
FALSE # A tibble: 2 x 4
FALSE .metric .estimator .estimate .config
FALSE <chr> <chr> <dbl> <chr>
FALSE 1 rmse standard 0.120 Preprocessor1_Model1
FALSE 2 rsq standard 0.908 Preprocessor1_Model1
SalePrice_xgb_fit <- fit(Workflow_SalePrice_xgb, data = train_data)
pred_xgb <-
predict(SalePrice_xgb_fit, val_data) %>%
mutate(modelo = "XGBoost",
.pred = exp(.pred)) %>%
bind_cols(val_data %>% select(SalePrice))
g1 <-
pred_xgb %>%
ggplot(aes(y = .pred, x = SalePrice)) +
geom_point() +
geom_abline(intercept = 0, col = "red") +
ggtitle('XGBoost')
g2 <-
pred_xgb %>%
select(.pred, SalePrice) %>%
gather(key, value) %>%
ggplot(aes(x = value, volor = key, fill = key)) +
geom_density(alpha = .2) +
labs(x = "", y = "")
g1 / g2
SalePrice_xgb_fit %>%
extract_fit_parsnip() %>%
vip()